{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# waimai_10k 说明\n",
    "0. **下载地址：** [Github](https://github.com/SophonPlus/ChineseNlpCorpus/raw/master/datasets/waimai_10k/waimai_10k.csv)\n",
    "1. **数据概览：** 某外卖平台收集的用户评价，正向 4000 条，负向 约 8000 条\n",
    "2. **推荐实验：** 情感/观点/评论 倾向性分析\n",
    "2. **数据来源：** 某外卖平台\n",
    "3. **原数据集：** [中文短文本情感分析语料 外卖评价](https://download.csdn.net/download/cstkl/10236683)，网上搜集，具体作者、来源不详\n",
    "4. **加工处理：**\n",
    "    1. 将原来 2 个文件整合到 1 个文件中\n",
    "    2. 去重"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = 'waimai_10k_文件夹_所在_路径'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. waimai_10k.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "评论数目（总体）：11987\n",
      "评论数目（正向）：4000\n",
      "评论数目（负向）：7987\n"
     ]
    }
   ],
   "source": [
    "pd_all = pd.read_csv(path + 'waimai_10k.csv')\n",
    "\n",
    "print('评论数目（总体）：%d' % pd_all.shape[0])\n",
    "print('评论数目（正向）：%d' % pd_all[pd_all.label==1].shape[0])\n",
    "print('评论数目（负向）：%d' % pd_all[pd_all.label==0].shape[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 字段说明\n",
    "\n",
    "| 字段 | 说明 |\n",
    "| ---- | ---- |\n",
    "| label | 1 表示正向评论，0 表示负向评论 |\n",
    "| review | 评论内容 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>review</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>1</td>\n",
       "      <td>送餐特别快,态度也好,辛苦啦</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6632</th>\n",
       "      <td>0</td>\n",
       "      <td>点了热带雨林披萨+饮料，和BBQ鸡肉披萨+饮料，送来的是两个奥尔良披萨+两个银耳冰粥，冰凉冰...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8849</th>\n",
       "      <td>0</td>\n",
       "      <td>难吃!!!油死了，味道烂</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11114</th>\n",
       "      <td>0</td>\n",
       "      <td>今天菜太咸，连着定了3天吃，一天比一天难吃。</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11661</th>\n",
       "      <td>0</td>\n",
       "      <td>送的太慢了，菜都凉了。</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9571</th>\n",
       "      <td>0</td>\n",
       "      <td>没有满减！</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10614</th>\n",
       "      <td>0</td>\n",
       "      <td>差评！定的时间是12点一刻，结果刚11点就送来了！果断退单。送餐前不看时间吗？</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7585</th>\n",
       "      <td>0</td>\n",
       "      <td>羊肉串太咸，还有些不新鲜。鸡心和鸡胗烤的太老</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6919</th>\n",
       "      <td>0</td>\n",
       "      <td>快递员挺好，速度挺快</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3192</th>\n",
       "      <td>1</td>\n",
       "      <td>小炒肉卷饼好辣~</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10224</th>\n",
       "      <td>0</td>\n",
       "      <td>送来的时候都凉了,味道一般,鲜果西米露就两口的量,鲜果就是一块西瓜一个西瓜籽</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7295</th>\n",
       "      <td>0</td>\n",
       "      <td>没放糖，没放奶油，好难喝</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>275</th>\n",
       "      <td>1</td>\n",
       "      <td>他家的奶茶超级好喝。。。</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8378</th>\n",
       "      <td>0</td>\n",
       "      <td>黑椒牛柳饭送成大排饭</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5879</th>\n",
       "      <td>0</td>\n",
       "      <td>一个半小时，可以</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7523</th>\n",
       "      <td>0</td>\n",
       "      <td>订单满减后应该是24，送过来要收我原价39？你搞笑呐，还少听加多宝！我管你什么美食送的还是你...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6590</th>\n",
       "      <td>0</td>\n",
       "      <td>真心也忒慢了，其他都还成</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1703</th>\n",
       "      <td>1</td>\n",
       "      <td>非常划算，很好</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5345</th>\n",
       "      <td>0</td>\n",
       "      <td>首选是得吐槽一下这家的速度,一个半小时起,然后卷饼包装很不错,酱香鸡肉的比较赞,飘香肘子一般...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1674</th>\n",
       "      <td>1</td>\n",
       "      <td>离我们远点55分钟送到的，可以理解，饼和粥都不错</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       label                                             review\n",
       "25         1                                     送餐特别快,态度也好,辛苦啦\n",
       "6632       0  点了热带雨林披萨+饮料，和BBQ鸡肉披萨+饮料，送来的是两个奥尔良披萨+两个银耳冰粥，冰凉冰...\n",
       "8849       0                                       难吃!!!油死了，味道烂\n",
       "11114      0                             今天菜太咸，连着定了3天吃，一天比一天难吃。\n",
       "11661      0                                        送的太慢了，菜都凉了。\n",
       "9571       0                                              没有满减！\n",
       "10614      0            差评！定的时间是12点一刻，结果刚11点就送来了！果断退单。送餐前不看时间吗？\n",
       "7585       0                             羊肉串太咸，还有些不新鲜。鸡心和鸡胗烤的太老\n",
       "6919       0                                         快递员挺好，速度挺快\n",
       "3192       1                                           小炒肉卷饼好辣~\n",
       "10224      0             送来的时候都凉了,味道一般,鲜果西米露就两口的量,鲜果就是一块西瓜一个西瓜籽\n",
       "7295       0                                       没放糖，没放奶油，好难喝\n",
       "275        1                                       他家的奶茶超级好喝。。。\n",
       "8378       0                                         黑椒牛柳饭送成大排饭\n",
       "5879       0                                           一个半小时，可以\n",
       "7523       0  订单满减后应该是24，送过来要收我原价39？你搞笑呐，还少听加多宝！我管你什么美食送的还是你...\n",
       "6590       0                                       真心也忒慢了，其他都还成\n",
       "1703       1                                            非常划算，很好\n",
       "5345       0  首选是得吐槽一下这家的速度,一个半小时起,然后卷饼包装很不错,酱香鸡肉的比较赞,飘香肘子一般...\n",
       "1674       1                           离我们远点55分钟送到的，可以理解，饼和粥都不错"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd_all.sample(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. 构造平衡语料"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd_positive = pd_all[pd_all.label==1]\n",
    "pd_negative = pd_all[pd_all.label==0]\n",
    "\n",
    "def get_balance_corpus(corpus_size, corpus_pos, corpus_neg):\n",
    "    sample_size = corpus_size // 2\n",
    "    pd_corpus_balance = pd.concat([corpus_pos.sample(sample_size, replace=corpus_pos.shape[0]<sample_size), \\\n",
    "                                   corpus_neg.sample(sample_size, replace=corpus_neg.shape[0]<sample_size)])\n",
    "    \n",
    "    print('评论数目（总体）：%d' % pd_corpus_balance.shape[0])\n",
    "    print('评论数目（正向）：%d' % pd_corpus_balance[pd_corpus_balance.label==1].shape[0])\n",
    "    print('评论数目（负向）：%d' % pd_corpus_balance[pd_corpus_balance.label==0].shape[0])    \n",
    "    \n",
    "    return pd_corpus_balance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "评论数目（总体）：4000\n",
      "评论数目（正向）：2000\n",
      "评论数目（负向）：2000\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>review</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>10436</th>\n",
       "      <td>0</td>\n",
       "      <td>难吃～石锅拌饭居然没酱～而且刚好晚了29分钟</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10468</th>\n",
       "      <td>0</td>\n",
       "      <td>等了很久，没关系，毕竟还在约定时间内，可是最让我忍不了的是真的很一般，个人口味吧，反正不和我...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1643</th>\n",
       "      <td>1</td>\n",
       "      <td>嗯，纸袋比较高大上</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8723</th>\n",
       "      <td>0</td>\n",
       "      <td>海参怎么是生的，没法吃，郁闷</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2431</th>\n",
       "      <td>1</td>\n",
       "      <td>送餐很快，送餐人员很热情！～</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5121</th>\n",
       "      <td>0</td>\n",
       "      <td>不如以前好吃，肘子都有味儿了！哎！</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10565</th>\n",
       "      <td>0</td>\n",
       "      <td>东西有些小贵。</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2413</th>\n",
       "      <td>1</td>\n",
       "      <td>虽然时间长了些但是很准时。下次记得给些番茄酱就更好了。,一个人吃足够了。好好吃</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11937</th>\n",
       "      <td>0</td>\n",
       "      <td>11点以前就定的餐，做了1小时48分钟，呵呵，我只想说：拜拜！！！</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1024</th>\n",
       "      <td>1</td>\n",
       "      <td>很好吃，面皮特别有嚼劲儿，酱料也很好吃</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       label                                             review\n",
       "10436      0                             难吃～石锅拌饭居然没酱～而且刚好晚了29分钟\n",
       "10468      0  等了很久，没关系，毕竟还在约定时间内，可是最让我忍不了的是真的很一般，个人口味吧，反正不和我...\n",
       "1643       1                                          嗯，纸袋比较高大上\n",
       "8723       0                                     海参怎么是生的，没法吃，郁闷\n",
       "2431       1                                     送餐很快，送餐人员很热情！～\n",
       "5121       0                                  不如以前好吃，肘子都有味儿了！哎！\n",
       "10565      0                                            东西有些小贵。\n",
       "2413       1            虽然时间长了些但是很准时。下次记得给些番茄酱就更好了。,一个人吃足够了。好好吃\n",
       "11937      0                  11点以前就定的餐，做了1小时48分钟，呵呵，我只想说：拜拜！！！\n",
       "1024       1                                很好吃，面皮特别有嚼劲儿，酱料也很好吃"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "waimai_10k_ba_4000 = get_balance_corpus(4000, pd_positive, pd_negative)\n",
    "\n",
    "waimai_10k_ba_4000.sample(10)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  },
  "widgets": {
   "state": {},
   "version": "1.1.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
